Get data

In this tutorial, we will be using 3 publicly available dataset downloaded from 10X Genomics repository. They can be downloaded using the following bash commands. Simply create a folder called data and then use curl to pull the data from the 10X database.

mkdir data; cd data
curl -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v2/pbmc_1k_v2_filtered_feature_bc_matrix.h5
curl -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.h5
curl -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5
cd ..

With data in place, now we can start loading libraries we will use in this tutorial.

suppressMessages(require(Seurat))
suppressMessages(require(scater))
suppressMessages(require(Matrix))

We can first load the data individually by reading directly from HDF5 file format (.h5). Note that among those , the dataset p3.1k actually has both gene expression and CITE-seq data, so we will use only the Gene Expression here.

v3.1k <- Read10X_h5("data/pbmc_1k_v3_filtered_feature_bc_matrix.h5", use.names = T)
v2.1k <- Read10X_h5("data/pbmc_1k_v2_filtered_feature_bc_matrix.h5", use.names = T)
p3.1k <- Read10X_h5("data/pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5", use.names = T)
## Genome matrix has multiple modalities, returning a list of matrices for this genome
p3.1k <- p3.1k$`Gene Expression`

Create Object

We can now load the expression matricies into objects and then merge them into a single merged object. Each analysis workflow (Seurat, Scater, Scranpy, etc) has its own way of storing data. We will add dataset labels as cell.ids just in case you have overlapping barcodes between the datasets. After that we add a column Chemistry in the metadata for plotting later on.

sdata.v2.1k <- CreateSeuratObject(v2.1k, project = "v2.1k")
sdata.v3.1k <- CreateSeuratObject(v3.1k, project = "v3.1k")
sdata.p3.1k <- CreateSeuratObject(p3.1k, project = "p3.1k")

# Merge datasets into one single seurat object
alldata <- merge(sdata.v2.1k, c(sdata.v3.1k,sdata.p3.1k), add.cell.ids=c("v2.1k","v3.1k","p3.1k"))

# Add in a metadata column that indicates v2 vs v3 chemistry
alldata$Chemistry <- ifelse(alldata$orig.ident == "v2.1k","v2","v3")

Here it is how the count matrix and the metatada look like for every cell.

as.data.frame(alldata@assays$RNA@counts[1:10,1:2])
head(alldata@meta.data,10)
##             v2.1k_AAACCTGAGCGCTCCA-1 v2.1k_AAACCTGGTGATAAAC-1
## MIR1302-2HG                        0                        0
## FAM138A                            0                        0
## OR4F5                              0                        0
## AL627309.1                         0                        0
## AL627309.3                         0                        0
## AL627309.2                         0                        0
## AL627309.4                         0                        0
## AL732372.1                         0                        0
## OR4F29                             0                        0
## AC114498.1                         0                        0
##                          orig.ident nCount_RNA nFeature_RNA Chemistry
## v2.1k_AAACCTGAGCGCTCCA-1      v2.1k       6631         2029        v2
## v2.1k_AAACCTGGTGATAAAC-1      v2.1k       2196          881        v2
## v2.1k_AAACGGGGTTTGTGTG-1      v2.1k       2700          791        v2
## v2.1k_AAAGATGAGTACTTGC-1      v2.1k       3551         1183        v2
## v2.1k_AAAGCAAGTCTCTTAT-1      v2.1k       3080         1333        v2
## v2.1k_AAAGCAATCCACGAAT-1      v2.1k       5769         1556        v2
## v2.1k_AAAGTAGGTAGCAAAT-1      v2.1k       4655         1221        v2
## v2.1k_AAATGCCGTCTAGAGG-1      v2.1k       5230         1803        v2
## v2.1k_AACACGTCACCTCGGA-1      v2.1k       3735         1129        v2
## v2.1k_AACACGTCATGGTTGT-1      v2.1k       3988         1347        v2

Calculate QC

Having the data in a suitable format, we can start calculating some quality metrics. We can for example calculate the percentage of mitocondrial and ribossomal genes per cell and add to the metadata. This will be helpfull to visualize them across different metadata parameteres (i.e. datasetID and chemistry version). There are several ways of doing this, and here manually calculate the proportion of mitochondrial reads and add to the metadata table.

# Way1: Doing it using Seurat function
alldata <- PercentageFeatureSet(alldata, "^MT-", col.name = "percent_mito")

# Way2: Doing it manually
total_counts_per_cell <- colSums(  alldata@assays$RNA@counts  )

mito_genes <- rownames(alldata)[grep("^MT-",rownames(alldata))]
head(mito_genes,10)
alldata$percent_mito <- colSums(  alldata@assays$RNA@counts[mito_genes,]  ) / total_counts_per_cell
##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"

In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins.

# Way1: Doing it using Seurat function
alldata <- PercentageFeatureSet(alldata, "^RP[SL]", col.name = "percent_ribo")

# Way2: Doing it manually
ribo_genes <- rownames(alldata)[grep("^RP[SL]",rownames(alldata))]
head(ribo_genes,10)
alldata$percent_ribo <- colSums(  alldata@assays$RNA@counts[ribo_genes,]  ) / total_counts_per_cell
##  [1] "RPL22"   "RPL11"   "RPS6KA1" "RPS8"    "RPL5"    "RPS27"   "RPS6KC1"
##  [8] "RPS7"    "RPS27A"  "RPL31"

Plot QC

Now we can plot some of the QC-features as violin plots.

feats <- c("nFeature_RNA","nCount_RNA","percent_mito","percent_ribo")
VlnPlot(alldata, group.by= "orig.ident", features = feats, pt.size = 0.1,ncol = 4) + NoLegend()

VlnPlot(alldata, group.by= "Chemistry", features = feats, pt.size = 0.1,ncol = 4) + NoLegend()

As you can see, the v2 chemistry gives lower gene detection, but higher detection of ribosomal proteins. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected. And we can plot the different QC-measures as scatter plots.

cowplot::plot_grid(
  FeatureScatter(alldata, "nCount_RNA"  , "nFeature_RNA", group.by = "orig.ident"),
  FeatureScatter(alldata, "percent_mito", "nFeature_RNA", group.by = "orig.ident"),
  FeatureScatter(alldata, "percent_ribo", "nFeature_RNA", group.by = "orig.ident"),
  FeatureScatter(alldata, "percent_ribo", "percent_mito", group.by = "orig.ident")
)


Filtering

Mitocondrial filtering

We have quite a lot of cells with high proportion of mitochondrial reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent.mito variable during scaling.this case we have as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much celltype signature left in those.at the plots, make resonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 25% mitochondrial reads and that will be used as a cutoff.

selected <- WhichCells(alldata, expression = percent_mito < 25)
length(selected)

# and subset the object to only keep those cells
data.filt <- subset(alldata, cells = selected)

# plot violins for new data
VlnPlot(data.filt, features = "percent_mito")

## [1] 2931

As you can see, there is still quite a lot of variation in percent mito, so it will have to be dealt with in the data analysis step.

Gene detection filtering

Extremely high number of detected genes could indicate doublets. However, depending on the celltype composition in your sample, you may have cells with higher number of genes (and also higher counts) from one celltype. In these datasets, there is also a clear difference between the v2 vs v3 10x chemistry with regards to gene detection, so it may not be fair to apply the same cutoffs to all of them. Also, in the protein assay data there is a lot of cells with few detected genes giving a bimodal distribution. This type of distribution is not seen in the other 2 datasets. Considering that they are all pbmc datasets it makes sense to regard this distribution as low quality libraries. Filter the cells with high gene detection (putative doublets) with cutoffs 4100 for v3 chemistry and 2000 for v2.

#start with cells with many genes detected.
high.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA > 4100)
high.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA > 2000 & orig.ident == "v2.1k")

# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(high.det.v2,high.det.v3)))

# check number of cells
ncol(data.filt)
## [1] 2859

Filter the cells with low gene detection (low quality libraries) with less than 1000 genes for v2 and < 500 for v2.

Plot fitered QC

Lets plot the same qc-stats another time.

feats <- c("nFeature_RNA","nCount_RNA","percent_mito","percent_ribo")
VlnPlot(data.filt, group.by= "orig.ident", features = feats, pt.size = 0.1,ncol = 4) + NoLegend()

VlnPlot(data.filt, group.by= "Chemistry", features = feats, pt.size = 0.1,ncol = 4) + NoLegend()